AITopics | inull 2

Collaborating Authors

inull 2

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives

Behdin, Kayhan, Chen, Wenyu, Mazumder, Rahul

arXiv.org Artificial IntelligenceJul-18-2023

We consider the problem of learning a sparse graph underlying an undirected Gaussian graphical model, a key problem in statistical machine learning. Given $n$ samples from a multivariate Gaussian distribution with $p$ variables, the goal is to estimate the $p \times p$ inverse covariance matrix (aka precision matrix), assuming it is sparse (i.e., has a few nonzero entries). We propose GraphL0BnB, a new estimator based on an $\ell_0$-penalized version of the pseudolikelihood function, while most earlier approaches are based on the $\ell_1$-relaxation. Our estimator can be formulated as a convex mixed integer program (MIP) which can be difficult to compute at scale using off-the-shelf commercial solvers. To solve the MIP, we propose a custom nonlinear branch-and-bound (BnB) framework that solves node relaxations with tailored first-order methods. As a by-product of our BnB framework, we propose large-scale solvers for obtaining good primal solutions that are of independent interest. We derive novel statistical guarantees (estimation and variable selection) for our estimator and discuss how our approach improves upon existing estimators. Our numerical experiments on real/synthetic datasets suggest that our method can solve, to near-optimality, problem instances with $p = 10^4$ -- corresponding to a symmetric matrix of size $p \times p$ with $p^2/2$ binary variables. We demonstrate the usefulness of GraphL0BnB versus various state-of-the-art approaches on a range of datasets.

algorithm 1, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2307.09366

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Promising Solution (0.65)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

A Fourier Approach to Mixture Learning

Qiao, Mingda, Guruganesh, Guru, Rawat, Ankit Singh, Dubey, Avinava, Zaheer, Manzil

arXiv.org Artificial IntelligenceOct-5-2022

We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the separation $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$. In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time $\mathrm{poly}(k)\cdot f(d, \Delta, \epsilon)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\epsilon$. Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2210.02415

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

On the Complexity of a Practical Primal-Dual Coordinate Method

Alacaoglu, Ahmet, Cevher, Volkan, Wright, Stephen J.

arXiv.org Machine LearningJan-19-2022

The various methods that have been proposed for (1.1) have favorable complexity guarantees in certain special cases. The plethora of methods and results makes it difficult for both theoreticians and practitioners to choose the method best suited to particular instances of (1.1). In this paper, we focus on improving the theory for an existing method, the PURE-CD algorithm described in [2]. We show that this method achieves or improves best-known complexity results for interesting special cases of (1.1). The state-of-the-art results are currently dispersed around different methods. 1

complexity, inull 2, knull 2, (15 more...)

arXiv.org Machine Learning

2201.07684

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)

Add feedback

Layer-Peeled Model: Toward Understanding Well-Trained Deep Neural Networks

Fang, Cong, He, Hangfeng, Long, Qi, Su, Weijie J.

arXiv.org Machine LearningFeb-15-2021

In this paper, we introduce the Layer-Peeled Model, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which in part explains the recently discovered phenomenon of neural collapse in deep learning training [PHD20]. Moreover, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term Minority Collapse, which fundamentally limits the performance of deep learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before its confirmation by our computational experiments.

layer-peeled model, minority class, neural network, (15 more...)

arXiv.org Machine Learning

2101.12699

Country:

North America > United States > Pennsylvania (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Feature Purification: How Adversarial Training Performs Robust Deep Learning

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Machine LearningSep-17-2020

Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.

artificial intelligence, machine learning, nullw, (17 more...)

arXiv.org Machine Learning

2005.1019

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Gambling (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Model Repair: Robust Recovery of Over-Parameterized Statistical Models

Gao, Chao, Lafferty, John

arXiv.org Machine LearningMay-20-2020

Traditional robust estimation assumes that the data are corrupted, and studies methods of estimation that are immune to these corruptions or outliers in the data. In contrast, we explore the setting where the data are "clean" but a statistical model is corrupted after it has been estimated using the data. We study methods for recovering the model that do not require re-estimation from scratch, using only the design and not the original response values. The problem of model repair is motivated from several different perspectives. First, it can be formulated as a well-defined statistical problem that is closely related to, but different from, traditional robust estimation, and that deserves study in its own right. From a more practical perspective, modern machine learning practice is increasingly working with very large statistical models. For example, artificial neural networks having several million parameters are now routinely estimated. It is anticipated that neural networks having trillions of parameters will be built in the coming years, and that large models will be increasingly embedded in systems, where they may be subject to errors and corruption of the parameter values. In this setting, the maintenance of models in a fault tolerant manner becomes a concern.

artificial intelligence, high probability, machine learning, (16 more...)

arXiv.org Machine Learning

2005.09912

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Learning Mixtures of Linear Regressions in Subexponential Time via Fourier Moments

Chen, Sitan, Li, Jerry, Song, Zhao

arXiv.org Machine LearningDec-16-2019

We consider the problem of learning a mixture of linear regressions (MLRs). An MLR is specified by $k$ nonnegative mixing weights $p_1, \ldots, p_k$ summing to $1$, and $k$ unknown regressors $w_1,...,w_k\in\mathbb{R}^d$. A sample from the MLR is drawn by sampling $i$ with probability $p_i$, then outputting $(x, y)$ where $y = \langle x, w_i \rangle + \eta$, where $\eta\sim\mathcal{N}(0,\varsigma^2)$ for noise rate $\varsigma$. Mixtures of linear regressions are a popular generative model and have been studied extensively in machine learning and theoretical computer science. However, all previous algorithms for learning the parameters of an MLR require running time and sample complexity scaling exponentially with $k$. In this paper, we give the first algorithm for learning an MLR that runs in time which is sub-exponential in $k$. Specifically, we give an algorithm which runs in time $\widetilde{O}(d)\cdot\exp(\widetilde{O}(\sqrt{k}))$ and outputs the parameters of the MLR to high accuracy, even in the presence of nontrivial regression noise. We demonstrate a new method that we call "Fourier moment descent" which uses univariate density estimation and low-degree moments of the Fourier transform of suitable univariate projections of the MLR to iteratively refine our estimate of the parameters. To the best of our knowledge, these techniques have never been used in the context of high dimensional distribution learning, and may be of independent interest. We also show that our techniques can be used to give a sub-exponential time algorithm for learning mixtures of hyperplanes, a natural hard instance of the subspace clustering problem.

algorithm, probability, tnull 2, (14 more...)

arXiv.org Machine Learning

1912.07629

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.13)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
(2 more...)

Genre:

Overview (0.67)
Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.91)

Add feedback

D-SPIDER-SFO: A Decentralized Optimization Algorithm with Faster Convergence Rate for Nonconvex Problems

Pan, Taoxing, Liu, Jun, Wang, Jie

arXiv.org Machine LearningNov-28-2019

Decentralized optimization algorithms have attracted intensive interests recently, as it has a balanced communication pattern, especially when solving large-scale machine learning problems. Stochastic Path Integrated Differential Estimator Stochastic First-Order method (SPIDER-SFO) nearly achieves the algorithmic lower bound in certain regimes for nonconvex problems. However, whether we can find a decentralized algorithm which achieves a similar convergence rate to SPIDER-SFO is still unclear. To tackle this problem, we propose a decentralized variant of SPIDER-SFO, called decentralized SPIDER-SFO (D-SPIDER-SFO). We show that D-SPIDER-SFO achieves a similar gradient computation cost-- that is, O ( null 3) for finding an null -approximate first-order stationary point--to its centralized counterpart. To the best of our knowledge, D-SPIDER-SFO achieves the state-of-the-art performance for solving nonconvex optimization problems on decentralized networks in terms of the computational cost. Experiments on different network configurations demonstrate the efficiency of the proposed method. Introduction Distributed optimization is a popular technique for solving large scale machine learning problems Li et al. (2014), ranging from visual object recognition Huang et al. (2017); He et al. (2016) to natural language processing V aswani et al. (2017); Devlin et al. (2019). For distributed optimization, a set of workers form a connected computational network, and each worker is assigned a portion of the computing task. The centralized network topology, like parameter server Jianmin et al. (2016); Dean et al. (2012); Li et al. (2014); Zinkevich et al. (2010), consists of a central worker connected with all other workers.

algorithm, d-spider-sfo, inull 2, (15 more...)

arXiv.org Machine Learning

1911.12665

Country:

Asia > Middle East > Jordan (0.04)
Asia > China (0.04)

Genre: Research Report (0.81)

Industry:

Education > Focused Education > Special Education (0.44)
Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Generalization Bounds for Neural Networks via Approximate Description Length

Daniely, Amit, Granot, Elad

arXiv.org Machine LearningOct-13-2019

We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class \[ H=\left\{W_t\circ\rho\circ \ldots\circ\rho\circ W_{1} :W_1,\ldots,W_{t-1}\in M_{d, d}, W_t\in M_{1,d}\right\} \] where the spectral norm of each $W_i$ is bounded by $O(1)$, the Frobenius norm is bounded by $R$, and $\rho$ is the sigmoid function $\frac{e^x}{1+e^x}$ or the smoothened ReLU function $ \ln (1+e^x)$. We show that for any depth $t$, if the inputs are in $[-1,1]^d$, the sample complexity of $H$ is $\tilde O\left(\frac{dR^2}{\epsilon^2}\right)$. This bound is optimal up to log-factors, and substantially improves over the previous state of the art of $\tilde O\left(\frac{d^2R^2}{\epsilon^2}\right)$. We furthermore show that this bound remains valid if instead of considering the magnitude of the $W_i$'s, we consider the magnitude of $W_i - W_i^0$, where $W_i^0$ are some reference matrices, with spectral norm of $O(1)$. By taking the $W_i^0$ to be the matrices at the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many typical regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families $H$ of predictors. We start by defining a new notion of a randomized approximate description of functions $f:X\to\mathbb{R}^d$. We then show that if there is a way to approximately describe functions in a class $H$ using $d$ bits, then $d/\epsilon^2$ examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is $\epsilon$-close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions.

approximate description length, matrix, sample complexity, (14 more...)

arXiv.org Machine Learning

1910.05697

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.82)

Add feedback

Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently

Ge, Rong, Wang, Runzhe, Zhao, Haoyu

arXiv.org Machine LearningSep-25-2019

However, it was observed that neural networks are able to fit the training data perfectly, even when the data/labels are randomly corrupted(Zhang et al., 2017). Recently, a series of work (Du et al. (2019); Allen-Zhu et al. (2019c); Chizat and Bach (2018); Jacot et al. (2018), see more references in Section 1.2) developed a theory of neural tangent kernels (NTK) that explains the success of training neural networks through overparametrization. Several results showed that if the number of neurons at each layer is much larger than the number of training samples, networks of different architectures (multilayer/recurrent) can all fit the training data perfectly. However, if one considers the number of parameters required for the current theoretical analysis, these networks are highly overparametrized. Consider fully connected networks for example. If a two-layer network has a hidden layer with r neurons, the number of parameters is at least rd where d is the dimension of the input.

neural network, singular value, smallest singular value, (14 more...)

arXiv.org Machine Learning

1909.11837

Country:

North America (0.14)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback